Skip to content

[MRG] ENH: K-Means SMOTE implementation #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Jun 12, 2019
Merged

[MRG] ENH: K-Means SMOTE implementation #435

merged 42 commits into from
Jun 12, 2019

Conversation

StephanHeijl
Copy link
Contributor

@StephanHeijl StephanHeijl commented Jun 22, 2018

What does this implement/fix? Explain your changes.

This pull request implements K-Means SMOTE, as described in Oversampling for Imbalanced Learning Based on K-Means and SMOTE by Last et al.

Any other comments?

The density estimation function has been changed slightly from the reference paper, as the power term yielded very large numbers. This caused the weighting to favour a single cluster.

@codecov
Copy link

codecov bot commented Jun 22, 2018

Codecov Report

Merging #435 into master will decrease coverage by 2.43%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #435      +/-   ##
==========================================
- Coverage    98.9%   96.47%   -2.44%     
==========================================
  Files          84       85       +1     
  Lines        5202     5329     +127     
==========================================
- Hits         5145     5141       -4     
- Misses         57      188     +131
Impacted Files Coverage Δ
imblearn/over_sampling/tests/test_kmeans_smote.py 100% <100%> (ø)
imblearn/over_sampling/_smote.py 97.82% <100%> (+0.55%) ⬆️
imblearn/over_sampling/__init__.py 100% <100%> (ø) ⬆️
imblearn/utils/estimator_checks.py 90.87% <100%> (-6.22%) ⬇️
imblearn/keras/tests/test_generator.py 8.62% <0%> (-91.38%) ⬇️
imblearn/tensorflow/_generator.py 34.28% <0%> (-65.72%) ⬇️
imblearn/keras/_generator.py 40.35% <0%> (-57.9%) ⬇️
imblearn/over_sampling/tests/test_smote_nc.py 94.3% <0%> (-5.7%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c630df3...c3a1502. Read the comment docs.

@chkoar
Copy link
Member

chkoar commented Jun 22, 2018

@felix-last as the author and creator of the algorithm are you willing to review this PR?

@StephanHeijl
Copy link
Contributor Author

@chkoar @felix-last Thanks in advance for taking a look at this. It must be said that my initial research was somewhat poor, which caused me to neglect the existing implementation by the original author. As such, this is was programmed without actually looking at that code. I nevertheless thought it valuable to contribute this back to the project.

@felix-last
Copy link

Unfortunately I don't have time for a detailed review. Just a few comments after glancing over:

  • The exponent used for the density computation is actually a hyperparameter of the algorithm described in the paper. In our implementation, it defaults to the number of features, but can be set arbitrarily by the user.
  • In the original implementation, k_neighbors is decreased when it is larger than the number of samples in a given cluster. This enables the user to specify, for example, k_neighbors=float('inf') to apply SMOTE to interpolate among all samples of a cluster.
  • This implementation does not encompass limit cases of SMOTE and random oversampling described in section 3.2 of the paper.

I'd like to point out that our implementation also complies with the imbalanced-learn API.

@StephanHeijl
Copy link
Contributor Author

StephanHeijl commented Jun 26, 2018

@felix-last Sincere thanks for the lookover. I'll be happy to make these changes and push them to the project, provided @chkoar would still like it included.

@glemaitre
Copy link
Member

@StephanHeijl we are fine including new methods (I would be frilled to get something working on categorical data :)) until that we provide references.
ping me when you made the last changes.

@pep8speaks
Copy link

pep8speaks commented Jul 1, 2018

Hello @StephanHeijl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1222:5: E303 too many blank lines (2)

Line 60:25: E128 continuation line under-indented for visual indent
Line 61:25: E128 continuation line under-indented for visual indent
Line 69:1: E302 expected 2 blank lines, found 1

Comment last updated at 2019-06-12 13:20:16 UTC

@StephanHeijl StephanHeijl changed the title [WIP] K-Means SMOTE implementation [MRG] K-Means SMOTE implementation Jul 12, 2018
@StephanHeijl
Copy link
Contributor Author

@glemaitre I took some time to work out the last kinks and implement the edge cases from the paper, but I believe this can be subjected to a thorough review.

@glemaitre
Copy link
Member

Ok I will try to have a look

@glemaitre
Copy link
Member

The PR looks in general good. @chkoar Before to carry on a full review I have an API question.

I am split between:

(1) adding a new kind of SMOTE and adding 3 new parameters (as in this PR);
(2) deprecate kind for SMOTE-SVM and create 2 new classes SMOTESVM and SMOTEKMeans.

In (1), I am scared that we will have to much parameters for the common case which is painful for the user. In (2), we will have 3 classes instead of 1.

What are you thoughts @chkoar

@chkoar
Copy link
Member

chkoar commented Jul 25, 2018

It seems that SMOTE class gets much responsibility. Also, according to this article there are almost 100 hundred published variations of SMOTE. So, I believe that option two is the way to go. We have to take care not to duplicate code between classes.

@glemaitre
Copy link
Member

OK so let's go for a base class BaseSMOTE to factorize code as much as possible.

@StephanHeijl I will make a PR and then you will need to move the code within a new class which will inherit from the base class. It should be a minimal change but quite beneficial.

@StephanHeijl
Copy link
Contributor Author

@glemaitre Will do, once the PR drops I'll integrate it and move the code. I agree that this should make SMOTE implementation a lot more accessible.

@glemaitre
Copy link
Member

@StephanHeijl I think that you can start to have a look to #440.
I still have couple of thing to update but the basically, you just have to inherit from BaseSMOTE and move the implementation in _sample.


svm_estimator : object, optional (default=SVC())
If ``kind='svm'``, a parametrized :class:`sklearn.svm.SVC`
classifier can be passed.

n_kmeans_clusters: int, optional (default=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could make this a KMeans estimator instead.
I am thinking that it could be interesting to plug a MiniBatchKMeans for instance which could be quicker to converge and memory efficient. This could be the default. However, it let people to be able to pass a KMeans classifier if interested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree, MiniBatchKMeans is now the default and a test has been included to ensure that using a "normal" kmeans instance will also work.

@glemaitre
Copy link
Member

I merged #440, you should be able to create a new class.
You can make the quick change such that I can do a full review.

@chkoar
Copy link
Member

chkoar commented Mar 7, 2019

@StephanHeijl thanks. @felix-last @glemaitre do you have time for a review?

@chkoar chkoar requested a review from glemaitre March 7, 2019 15:30
Copy link

@felix-last felix-last left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity check, I tried creating some of the toy dataset plots from our paper using this implementation.

For some examples this implementation yields unwanted results. Of course, we couldn't expect results to be exactly the same, but it could also be a hint that something is wrong (or that I'm overlooking something or it's just an unlucky random seed). I tried translating our implementation's default parameters (which I used for the toy plots in the paper) to the corresponding parameters of this implementation.

Comparison our implementation (left) vs. this implementation (right)

Screen Shot 2019-03-30 at 22 41 00

After a quick glance, the code looks good to me.

Feel free to play around with the plotting script. Prerequisites are to install requirements.txt, install this PR's imblearn and create a config.yml with a results_dir and dataset_dir. Download the toy datasets into the datasets dir. For comparison, here's all the toy plots obtained from our implementation: kmeans-smote-figures.zip.

@StephanHeijl
Copy link
Contributor Author

StephanHeijl commented Apr 1, 2019

@felix-last Thank you very much for taking the time to look into this! It appears that I entered a wrong parameter for the _make_samples function. I have resolved this and the results that I gathered from your (very helpful) plotting script appear to be far more in line with the expected results. I have uploaded the results here: https://github.com/StephanHeijl/imblearn-kmeans-toydatasets . Sample:
KMeans Smote Sample

I have updated the tests and the script in accordance with this improvement.

Copy link

@felix-last felix-last left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks more sane! Looks good to me now (but I didn't do a thorough code review).

Copy link
Member

@chkoar chkoar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from my latest comments the only thing that is missing is some user guide.

Use the parameter ``sampling_strategy`` instead. It will be removed
in 0.6.

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the docstring we could add the reference of the paper and after that an interactive example. Check here for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an interactive example that specifically demonstrates the utility of the KMeansSmote class, 3 blobs, with the positive class in the middle and the negative classes on the outside and a single negative sample in the middle blob. The example shows that after resampling no new samples are added in the middle blob. Inspired by the following toy problem:
image

@StephanHeijl
Copy link
Contributor Author

@chkoar I have attended to your proposed changes, I hope this covers enough of an interactive example.

@glemaitre glemaitre added this to the 0.5 milestone Jun 11, 2019
@chkoar
Copy link
Member

chkoar commented Jun 12, 2019

@StephanHeijl could you please rebase your branch so we can finally merge this pr?

@chkoar chkoar changed the title [MRG] K-Means SMOTE implementation [MRG] ENH: K-Means SMOTE implementation Jun 12, 2019
@glemaitre
Copy link
Member

I made the push to solve the conflict. I will review and make couple of changes regarding the style if necessary and merge it.

@glemaitre glemaitre merged commit aff4125 into scikit-learn-contrib:master Jun 12, 2019
@StephanHeijl StephanHeijl deleted the kmeans-smote branch June 12, 2019 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants